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Abstract 


The  goal  ot  this  project  is  the  development  ot  pattern-recogni¬ 
tion  and  signal-processing  methods  that  will  provide  indices  ot 
responsivitv  to  challenge  when  applied  to  Army-supplied  human  cardio¬ 
vascular  and  psvchomotor  data.  Time-series  and  point-process  tech¬ 
niques  will  torm  the  basis  ot  the  approach,  and  the  assumptions  that 
underlie  the  methods  will  be  examined  and  tested.  The  relationship  ot 
trequent  and  briet  events,  it  any.  to  the  indices  will  be  elucidated. 

This  report  presents  the  results  ot  the  work  over  the  past  year, 
which  has  proceeded  along  three  parallel  lines:  the  design,  implemen¬ 
tation,  and  testing  ot  data-preprocessing  steps  that  restore  physio¬ 
logic  integrity  to  noise-corrupted  data:  the  preliminary  implementa¬ 
tion  and  evaluation  ot  several  clustering  and  pattern-recognition 
methods:  and  the  selection  ot  a  data-segmentation  algorithm  tor  the 
partitioning  ot  time-series  data.  The  work  toiiowed  naturally  trom 
that  ot  the  previous  year,  in  which  we  reviewed  the  state  ot  the  art 
ot  the  understanding  ot  the  links  between  the  noninvasive  measurements 
described  here,  and  the  underlying  physiology. 

Plans  are  described  tor  the  third  year  ot  the  work,  which  will 
combine  those  separate  tasks  into  a  single  tool  tor  physiologic  state 
c harac  ter izat ion . 
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1.  Introduction 


The  study  ot  heart-rate  variability  <  HRV  >  has  become  ot  increas¬ 
ing  interest,  especially  in  the  10  years  since  the  Biological 
Engineering  Societv  held  a  meeting  on  the  subject  in  London.  In 
addition  to  the  studies  dealing  with  underlying  physiology,  analysis 
techniques,  and  applications  to  physiology,  there  has  been  work  in 
applications:  e.g..  the  estimation  ot  levels  ot  workload,  and  detec¬ 
tion  ot  mental  illness,  using  the  HRV. 

In  a  monitoring  environment,  where  an  individual's  ability  to 
pertorm  a  task  is  to  be  described,  it  is  important  to  have  a  rapid, 
unambiguous  measure.  Because  ot  the  relationship  ot  sleep  and  sleep 
deprivation  to  performance,  it  seems  reasonable  to  evaluate  the 
ettect  ot  adding  a  sleep-related  parameter  to  any  kind  ot  noninvasive 
measurement  system. 

Accordingly,  we  are  considering  in  this  work  the  combination  ot 
HRV  and  an  activity  measure  (recorded  with  an  actigraph  or  actometer) 
to  assess  performance.  The  ways  in  which  the  data  are  processed  and 
described  are  presented  in  the  following  report. 

2.  Scope  ot  Work  and  Progress  to  Date 

Our  approach  makes  use  ot  pattern-recognition  and  signal  pro¬ 
cessing  techniques  in  the  development  ot  methods  tor  classifying  human 
cardiovascular  and  psvchomotor  response  to  challenge.  Table  l 
presents  the  sequence  ot  tasks  which  constitute  our  effort.  Detailed 
information  about  the  topics  represented  by  each  box  make  up  the  bulk 
ot  this  report. 

It  is  important  to  note  that  a  number  ot  the  tasks  have  been 
proceeding  in  parallel.  Tasks  2.2.  2. 4  (development  ot  computer 
programs).  .  8  and  2.V  (development  ot  programs  and  performance 
evaluation  using  sample  data  from  outside  this  project)  are  essen¬ 
tially  complete.  The  programs  developed  there  are  now  ready  tor 
immediate  application  once  Task  2.i  (segmentation  ot  signals)  has  been 
completed . 

2.1  Data 

The  actigraph  and  R-R  interval  signals  supplied  by  the  Army  were 
each  broken  into  id  twentv-tour  hour  periods  which  are  further 
subdivided  into  halt-hour  tiles.  The  data  are  binary  with  no  end-ot- 
tiie  marks.  These  tiles  have  been  converted  into  ASCII  tiles, 
uploaded  onto  the  IBM  4141,  and  stored  on  a  tape.  A  VAX  L1//BU 
version  ot  that  tape  also  has  been  produced. 

The  data  come  from  a  study  ot  the  ettect  ot  different  doses  ot 
atropine  on  the  heart  and  physical  movement.  Two  males  in  their 
twenties  participated.  The  subjects  were  hospitalized,  ambulatory  but 
restricted.  During  the  experiment,  an  ECG  and  3n  actigraph  signal 
were  continuously  recorded.  The  subjects  wore  actigraphs  on  their 
right  wrists  which  recorded  a  voltage  signal  proportional  to  the 
amount  ot  g-torce  ot  the  wrist  in  the  lateral-medial  plane. 


TABLE  ! 

TASK  SEQUENCE 


Receive  raw  R-R  interval  and  ectigraph  waveforms 
from  the  Army. 


Remove  noise.  Base  algorithms  on  physiological 
criteria  and  known  instrumentation  artifacts.  Collect 
statistics  about  number  and  kinds  of  problems. 


Segment  actigraph  and  R-R  interval  signals. 


Extract  features  from  each  segment.  Work  in  both  the 
time  domain  and  frequrncy  domain.  Study 
correlation  among  features.  Choose  those  least 

correlated. 


Use  clustering  algorithms  to  learn  the  natural 
groupings  of  the  segments  in  feature  space. 


Use  segment  labels  to  be  provided  by  Walter  Reed 
to  evaluate  the  results  of  clustering. 


Reduce  the  number  of  features  using  dimensionality- 
reduction  techniques. 


Build  classifiers.  Examine  both  parametric  and 
nonperametric  models. 

I  "  1  —    


Test  classifiers  with  new  data  and  evaluate 
performance. 


The  experiment  was  conducted  over  5  non-cont iguous  days,  one  tor 
each  level  ot  atropine.  Atropine  was  given  about  30  minutes  atter  the 
start  ot  each  session,  either  in  an  intravenous  or  intramuscular  form. 
Table  2  outlines  the  protocol  used. 

Table  2 

EXPERIMENT  PROTOCOL 


Study  Schedule 

Subject  i 

Subject  1 

Dav 

Start  Time  Dose 

Start  Time  Dose 

i 

010/ 

0.0 

Ob  lb 

1 .0 

mg  IM 

* 

(J  /  1  / 

0.5 

me 

IM 

Ub4  / 

2.0 

mg  IV 

3 

Ob  35 

1  .0 

me 

IM 

0121 

0.5 

mg  IM 

4 

Ob  3  / 

2.0 

mg 

IM 

0112 

0.0 

0 

Ob  3  5 

2.0 

mg 

I” 

0101 

2.0 

mg  Im 

The  Artnv 

oertormed  the 

toiiowing  operations  on  the 

data 

beton 

providing  it  tor  analysis.  Both  the  ECG  and  actigraphv  signals  were 
converted  to  digital  torm.  On  playback,  the  actigraphv  signal  was 
amplified  (Oxford  Event  Demodulator  Amplifier  model  PM- 3)  and  filtered 
(bandpass  filter  I  .04-4  Hz  I  by  Coulbourn  Instrument  model  So- 3b).  The 
signal  was  sampled  with  a  programmable  digital  oscilloscope  (Norland 
model  3U01A  with  12bK-vord  butter  memory  tor  the  channel).  The  signal 
was  sampled  at  an  effective  rate  ot  15  Hz  and  the  RMS  value  ot  each 
3-sec  interval  stored  in  a  Corona  computer  running  under  MS-DOS  1.00. 
For  the  ECG  signal,  the  R-peaks  were  detected  using  a  maximum-slope 
detection  algorithm  with  a  real  sampling  rate  ot  400  Hz.  The  R-R 
intervals  were  stored.  Both  the  processed  ECG  and  actigraphv  data 
were  divided  into  halt-hour  tiles. 

2.2  Preprocessing 

Before  the  data  can  be  analyzed,  any  noise  produced  by  instrumen¬ 
tation  and  physiological  artitacts  must  be  removed.  Data  adjustment 
algorithms  have  been  written  to  process  R-R  interval  and  actigraph 
data.  Statistics  on  the  number  and  kinds  ot  problems  which  appear  in 
the  data  will  be  kept  tor  future  analysis  which  may  lead  to  more 
streamlined  data-adjustment  algorithms. 

2.2.1  R-R  Interval  Data 

Noise  can  be  introduced  into  the  ECG  data  in  three  ways:  (1) 
physiologic  artitacts,  (2)  tape-drive-induced  artifact,  and  (3) 
residual  instrument  noise.  To  identity  the  noise  in  the  signal,  we 
determined  an  acceptable  range  tor  heart  rate.  We  reviewed  the 
physiological  literature  which  indicated  that  a  normal  heart  rarely 
tails  below  a  resting  heart  rate  ot  3b  beats  per  minute  (bpm)  or  above 
20U  bpm  which  can  be  reached  during  extremely  vigorous  exercise.  A 


reasonable  acceptable  ranee,  therefore,  is  40  bom  to  idO  bpm  (corre¬ 
sponding  to  R-R  intervals  between  313  ms  and  141V  ms).  By  this 
standard,  the  ECG  data  are  tairlv  noise-tree.  On  average,  only  J"  ot 
the  intervals  in  each  30-minute  tile  are  out  ot  range. 

A  three-part  algorithm,  described  in  detail  below,  was  developed 
to  eliminate  the  noise  trom  each  ECG  tile.  The  tirst  part  truncates 
the  data  tile  to  eliminate  the  tape  stoppage  noise  which  appears  as  a 
cluster  ot  intervals  less  than  333  ms.  about  1  bO  intervals  trom  the 
end  ot  each  tile.  The  second  part  makes  adjustments  to  the  data  in  a 
conservative  manner.  The  algorithm  is  designed  so  that  most  ot  the 
correction  happens  here.  The  third  part  is  made  up  ot  two  sections, 
both  ot  which  are  designed  to  change  the  data  tile  just  enough  to 
allow  the  second  part  ot  the  algorithm  to  resume.  The  tlow  chart  ot 
the  toliowing  algorithm  can  be  tound  in  Appendix  1. 

At  the  beginning  and  end  ot  this  process,  information  about  the 
original  data  set.  the  difficulty  ot  the  noise-elimination  task,  and 
the  resulting  noise-tree  data  set  is  stored  tor  use  in  developing 
confidence  measures  needed  in  future  analysis.  Specifically,  the 
following  information  about  each  tile  is  kept: 

1.  the  amount  ot  time  truncated  trom  the  data  tile  which 
represents  bad  data  due  to  tape  stoppage. 

3.  the  number  ot  times  the  algorithm  reached  a  point  between 
the  second  3nd  third  parts,  a  measure  ot  how  difficult  it 
was  to  correct  the  data. 

3.  the  number  ot  intervals  less  than  333  ms. 

4.  the  number  ot  intervals  between  1  bOO  and  39VV  ms, 

b.  the  number  ot  intervals  greater  than  IVVV  msec.. 

t>.  the  number  and  lengths  ot  runs  ot  intervals  between  3  3  3  and 

14US  ms,  our  acceptable  range. 

After  truncation,  the  majority  ot  out-ot-range  intervals  in  each 
tile  tali  between  1  bOO  and  3yyy  ms.  Those  greater  than  3yyv  ms 
constitute  the  next  largest  group.  Intervals  less  than  333  ms  rareiv 
occur . 

2.2 A. I  Algorithm  Part  i:  Preprocessing  and  Elimination  ot  Noise 
Caused  by  Stopping  the  Tape 

The  goal  ot  this  part  ot  the  algorithm  is  to  cut  the  data  tile  at 
the  earliest  interval  in  the  cluster  ot  short  intervals  caused  by 
stopping  the  tape  during  processing.  The  method  used  is  a  search 
procedure  which  locates  this  point  by  using  our  knowledge  that  a 
cluster  ot  short  intervals,  occuring  more  densely  than  anywhere  else 
in  the  tile,  occurs  between  100  and  300  intervals  trom  the  end  ot  each 
tile. 

First,  five  non-overiapping  windows,  which  each  hold  tive 
intervals,  move  as  one  window  trom  the  end  ot  each  tile  toward  the 


beginning,  examining  one  interval  at  a  time.  When  three  out  ot  the 
live  windows  contain  at  least  one  interval  less  than  111  ms.  the 
search  stops.  We  want  our  chosen  interval  to  be  in  a  region  con¬ 
taining  many  short  intervals  and  not  to  be  an  isolated  point.  The 
interval  less  than  Hi  ms  which  is  closest  to  the  end  ot  the  tile  is 
selected.  From  that  interval,  we  jump  1 50  intervals  back  in  time  into 
a  section  ot  the  tile  which  should  be  near  the  cluster  ot  short 
intervals.  At  this  point.  2  windows  ot  15  intervals  each  are  created. 
They  are  moved  back  in  time,  in  a  step-wise,  non-overlapping  tashion. 
until  no  short  intervals  are  contained  in  the  iett  window.  When  this 
occurs,  the  tocus  switches  to  the  right  window  which  is  moved  one 
interval  at  a  time  toward  the  end  ot  the  tile  (i.e..  to  the  right), 
until  the  window  contains  at  least  three  intervals  smaller  than  Hi 
ms.  The  tile  is  truncated  at  this  earliest  short  interval. 

2.2. 1.2  Algorithm  Part  2:  Conservative  Approach 

For  the  remaining  discussion  ot  the  algorithm,  it  is  usetui  to 
visualize  an  interval,  with  a  lett  neighborhood  made  up  ot  the  inter¬ 
vals  which  precede  it  in  time,  and  a  right  neighborhood  that  contains 
the  intervals  which  tollow  it  in  time.  These  neighborhoods  will 
always  be  qualitied  by  a  number  which  is  the  size  ot  the  neighborhood 
(i.e..  the  number  ot  members  in  the  set  called  neighborhood). 

In  Part  2.  the  out-ot-range  data  values  (those  not  between  Hi 
and  1AVV  ms)  are  broken  into  three  cases.  The  tirst  is  called  short 
and  contains  ail  values  less  than  Hi  ms.  The  second  is  called  medium 
and  contains  the  intervals  between  1 5U0  and  I'VVy  ms.  The  large  case 
contains  all  intervals  greater  than  2'H'i  ms. 

Short  intervals  can  be  the  result  ot  equipment-caused  noise  or  an 
extra  systole  which  is  a  premature  contraction  ot  the  heart  origi¬ 
nating  at  a  site  other  than  the  usual  pacemaker.  An  extra  systole  can 
cause  the  cardiac  cycle  to  lengthen  slightly.  In  either  case,  a  talse 
R-oeak  has  been  detected.  This  short  interval  probably  belongs  to  one 
ot  its  immediate  neighbors  (it  it  has  been  caused  by  noise),  and  its 
sum  with  the  neighbor  should  be  close  to  the  lengths  ot  the  other 
intervals  surrounding  it.  It  it  is  an  extra  systole,  the  length  ot 
the  sum  will  be  somewhat  larger  than  that  ot  those  intervals  surroun¬ 
ding  it. 

•  Case  1.  which  deals  with  these  short  intervals,  requires  that 
three  intervals  in  the  acceptable  range  lie  on  either  side  ot  the 
short  interval.  This  short  interval  is  than  added  to  each  neighbor, 
and  these  two  sums  are  compared  with  the  mean  ot  the  six  closest 
neighbors.  The  sum  closer  to  the  mean  Is  chosen  and  accepted  as  the 
correct  interval,  it  it  is  less  than  15UU  ms. 

Medium  intervals  are  the  result  ot  system  noise.  The  average 
heart  interval  is  about  /  5U  ms,  and  ms,  the  longest  medium 
interval,  represents  tour  /5U-ms  intervals.  Our  strategy,  therefore, 
is  based  on  the  tact  that  these  intervals  represent  no  more  than  three 
missed  peaks. 

In  Case  2,  each  medium  Interval  is  divided  into  equal  sub-inter¬ 
vals  based  on  the  size  ot  the  median  interval  ot  its  surrounding 
neighborhood.  To  quality  tor  this  procedure,  a  medium  interval  must 
have  either  tour  contiguous  neighbors  on  each  side  in  the  range  ot  Hi 
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to  14VV  ms.  or  three  contiguous  neighbors  on  each  side  in  that  range. 
The  median  ot  those  neighbors  is  calculated. 

We  chose  not  to  use  an  unequal  number  ot  in-range  neighbors  trom 
each  side  in  order  not  to  bias  the  median  determination.  Four  and 
three  were  chosen  since  they  represent  enough  time  to  estimate  the 
rate  in  that  area.  Including  more  neighbors  would  give  intiuence  to 
intervals  too  tar  away  in  time  to  be  related.  It  has  been  noted  11,11) 
that  heart  rate  can  be  altered  by  the  sympathetic  system  within  a  tew 
beats  at  most  and  within  a  cycle  at  best. 

The  interval  is  partitioned  into  sub-intervals  the  size  ot  the 
neighborhood  median:  any  remaining  time  is  distributed  among  the 
sub-intervals.  At  this  point,  it  any  interval  is  greater  than  141W  ms 
or  the  remainder  is  greater  than  halt  the  mediam,  the  original 
interval  is  repartitioned  into  the  number  ot  previous  suhintervals 
plus  one.  and  the  new  remainder  distributed. 

Intervals  exceeding  ms  arise  when  three  or  more  consecutive 
R-peaks  are  missed:  we  call  this  situation  the  large  case.  Any 
information  about  heart  rate  acceleration  or  deceleration  would  be 
lost  it  this  larger  span  ot  time  simply  were  divided  into  equal 
pieces.  Another  strategy  has  been  chosen. 

Again,  two  contiguous  neighborhoods  ot  intervals  in  the  accep¬ 
table  range  are  required,  but  this  time  each  neighborhood  is  a 
"spanning  set"  made  up  ot  a  varying  number  ot  intervals  whose  total 
time  equals  or  just  exceeds  the  amount  ot  time  in  the  large  interval. 
From  these  spanning  sets,  we  calculate  a  iett  limit  and  a  right  limit 
tor  an  arithmetic  progression  that  is  used  to  divide  the  large 
interval.  When  the  variation  in  a  spanning  set  is  in  the  range  iJU's. 
we  use  the  mean  ot  the  set  as  the  limit  associated  with  that  spanning 
set.  Otherwise,  we  use  the  median,  which  is  less  attected  bv  extreme 
values . 

2.2.1. 3  Algorithm  Part  3 

The  program  iterates  through  Part  2  until  there  are  no  more 
out-ot-range  intervals  which  meet  the  requirements,  principally  that 
an  interval  must  have  intervals  in  the  range  ot  did  to  14VV  ms  on 
either  side  ot  it.  At  this  point,  Section  A  ot  Part  3  is  invoked.  All 
intervals  which  meet  its  criteria  are  now  adjusted.  Again,  the 
intervais  are  identified  as  short,  medium,  and  large  but  medium  and 
large  are  handled  in  the  same  way.  The  medium  and  large  intervals  are 
merged  into  the  class  medium/ large. 

Both  cases  in  Section  A  use  an  alternate  approach  trom  that  used 
in  Part  2  which  required  intervais  in  the  acceptable  range  to  be  on 
both  sides.  Here,  oniv  a  contiguous  string  ot  1U  in-range  intervais 
which  lie  on  one  side  ot  the  out-ot-range  interval  is  required.  Ten 
intervals  was  chosen  because  an  interval  much  farther  away  would  add 
little  information  about  the  true  nature  ot  the  heart  rate,  while  the 
tewer  might  contain  too  little  information  about  how  the  heart  rate  is 
changing.  The  areas  in  which  we  are  now  working  have  clusters  ot 
out-ot-range  intervais  (there  are  only  acceptable  intervals  on  one 
side),  implying  considerable  noise. 

For  the  short  case,  there  could  be  a  run  ot  ten  intervais  in  the 
acceptable  range  on  either  side  it,  in  Part  2,  the  interval  when  added 
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to  3  neighbor  was  greater  than  14W  ms.  Therefore,  it  is  necessary  to 
check  both  sides  ot  the  out-ot-range  interval.  The  short  interval  is 
added  to  either  nearest  neighbor  which  meets  the  criterion  ot  being  in 
a  run  ot  ten  and.  it  more  than  one.  the  new  value  closer  to  the  mean 
ot  its  associated  run  ot  ten  is  chosen  as  the  correct  interval. 
Otherwise,  it  only  one  run  ot  ten  was  tound.  the  sum  ot  the  interval 
and  its  near  neighbor  is  accepted.  This  new  interval  must  be  less 
than  l oUU  ms. 

The  medium/large  case  is  a  variation  ot  the  large  case  in  Part 
Part  2  bridges  the  gap  created  by  the  out-ot-range  value  by  tilling  in 
values  with  an  arithmetic  progression  run  trom  the  mean  ot  the  iett 
neighborhood  to  that  ot  the  right.  In  the  medium/ large  case,  however, 
since  we  have  already  identified  a  run  ot  ten  intervals  in  the 
acceptable  range  on  one  side,  we  have  one  ot  those  neighborhood 
statistics.  What  we  do  not  have  is  something  tor  the  other  end  ot  the 
progression . 

Our  solution  works  only  with  the  five  intervals  closest  to  the 
out-ot-range  interval,  among  the  run  ot  ID.  One  limit  is  the  median  ot 
that  neighborhood  ot  five.  It  serves  as  one  end  ot  the  arithmetic 
progression.  The  other  end  ot  the  progression  is  the  average  ot:  (1) 
twice  the  mean  ot  the  neighborhood  ot  five:  (2)  the  mean  ot  whatever 
intervals  in  the  acceptable  range  are  tound  within  five  intervals  on 
the  other  side.  Whatever  close,  in-range  intervals  exist  on  the  other 
side  should  have  some  limited  influence  on  the  nature  ot  the  arith¬ 
metic  progression.  After  the  arithmetic  progression  is  calculated, 
the  remaining  time  is  distributed  uniformly. 

It  no  out-ot-range  intervals  could  be  changed  in  Section  A. 
Section  B  is  used  because  someching  has  to  be  altered  to  allow  Part  2 
to  resume.  Hence,  only  one  or  two  out-ot-range  intervals  are  changed. 
The  two  cases  are  again  short  and  medium/ large. 

First,  the  longest  run  ot  in-range  intervals  in  the  entire  tile 
is  identified.  The  medium/ large  out-ot-range  values  at  either  end  are 
chosen  first.  The  medium/ large  case  is  similar  to  that  tound  in 
Section  A.  One  end  ot  the  arithmetic  progression  is  the  median  ot  the 
five  closest  in-range  values  or  however  many  in-ranee  values  there  are 
in  the  run.  The  other  end  is  the  mean  ot  that  five  (or  howevermanv 
in-range  values  there  are  in  the  run),  modified  by  any  in-range  values 
lying  within  five  ot  the  medium/ large  value.  Any  remainder  after  the 
progression  has  been  calculated  is  uniformly  distributed. 

A  short  interval  is  adjusted  only  it  nothing  has  been  chanced  by 
the  Section  B  medium/ large  algorithm,  because  this  procedure  is  the 
most  arbitrary.  The  short  interval  is  simply  added  to  its  smaller 
neighbor  and  no  test  tor  variability  or  size  is  made. 

2.2.2  Noise  Reaovai:  Actigraph  Data 

Each  data  tile  has  an  offset  which  must  be  subtracted  to  produce 
a  zero-minimum  signal.  The  signals  are  otherwise  quite  clean. 
These  are  true  time-series  data,  ot  fixed  length  per  tile. 
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2.5  Segmentation 

Once  the  noise  has  been  removed,  we  will  subdivide  the  signals 
into  pieces  at  points  where  the  nature  ot  the  signal  changes.  Since 
each  24-hour  R-R  interval  signal  has  a  companion  24-hour  actigraph 
signal,  several  approaches  are  available.  First,  the  R-R  interval 
signal  and  the  actigraph  signal  may  be  segmented  separately  and  the 
correspondence  ot  their  boundaries  examined. 

A  second  approach  is  a  hierarchical  one  in  which  the  segmentation 
ot  one  signal  would  determine  the  boundaries  ot  the  other.  The 
hierarchical  technique  will  be  examined  during  the  next  year.  The 
actigraph  signal  will  guide  the  process  because  it  represents  only 
activity,  a  simpler  physiological  event  than  heart  rate  which  contains 
many  components  like  respiratory  sinus  arrhythmia  (RSA)  and  bio¬ 
rhythms.  Statistics  on  the  number  and  lengths  ot  segments  will  be 
kept  to  aid  in  the  evaluation  ot  the  method. 

The  segmentation  algorithm  we  have  chosen  I  1  I  looks  tor  differ¬ 
ences  in  the  parameters  between  two  segments  ot  the  signal  which  have 
been  modeled  as  autoregressive  (AR)  processes  ot  the  same  order,  and 
fixes  a  boundary  between  dissimilar  pieces. 

The  approach  is  to  model  the  finite-duration  random  time  series 
by  a  stationary,  normally-distributed  autoregressive  process  ot  order 
p.  Stationarity  means,  qualitatively,  that  the  graph  ot  the  time 
series  looks  about  the  same  near  one  time  as  near  another.  More 
formally,  all  statistical  properties  ot  stationary  time  series  remain 
unchanged  when  the  period  ot  observation  is  shitted  forward  or 
backward  in  time.  In  particular,  the  mean  and  the  variance  (as  well 
as  the  higher-order  moments)  do  not  change  with  time,  and  the  autoco¬ 
variance  between  two  values  separated  by  i  time  units  depends  only  on 
t.  Although  many  real  time  series  may  not  fulfill  those  conditions 
perfectly,  the  tools  that  are  derived  under  the  assumption  ot  station¬ 
arity  often  work  quite  well  on  those  series. 

By  autoregressive  ot  order  p,  we  mean  that  the  series  (  r(  t  U  . 
t*1.2 . .V  can  be  written 

p 

^ 'a  rl t-i )  *  a  u(  t ) 

L* 

i*0 


where  3(j*l  and  aj  .  a?,...,aD  are  coefficients  that  aiiow  r(  t )  to  be 
expressed  in  terms  ot  the  p  previous  values  ot  the  series.  The  error, 
or  disturbance,  term  <xu(t)  is  assumed  to  be  an  uncorreiated  stationary 
normally-distributed  series  with  variance  . 

Standard  methods  (4J  exist  tor  estimating  the  parameters  a-^,  and 
it  two  time  series  r(t)  and  s(t)  exist,  it  is  possible  to  compute  the 
joint  likelihood  (probability)  ot  (r(t)I  and  <s(t)J  conditioned  on  the 
first  p  observations  ot  each  sequence.  We  thus  can  find  the  maximum 
likelihood  under  the  null  hypothesis  that  the  parameter  sets  are  equal 
and  compare  it  to  the  maximum  likelihood  under  the  alternative 
hypothesis  that  the  parameters  are  arbitrary.  A  maximum-iikeiihood 
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ratio  then  can  be  t  ound  that  can  be  transformed  into  a  distance 
measure  d  such  that  d=U  tor  sequences  having  identical  parameter  sets. 
This  measure  is  derived  in  111  as  follows: 

The  joint  likelihood  ot  lr(t)f  and  ls(t)f  is 

~K  -n*  , 

1  =  (u  (o  v'liJT)  exo 

R  S  ‘  | 

where 

Nr=*  length  ot  |  r(  t )  I 
Mg*  length  ot  I s( t )  > 
p  =  order  ot  process 
‘m'r  =  NR  -  p 
N‘s  =  »S  -  P 

Cr  and  Cg  are  the  covariance  matrices  ot  I r(  t ) f  and  (s(t)f. 
respectively 

a  =  (  ayaja_>. .  .aD) 


R  T  „ 

2  -R  CR  -R  “ 
-aR 


~  S  T  \ 

2o's  ~S  S'S  t 


Let  1  denote  the  maximum  likelihood  under  the  null  hypothesis  that 

a  *  a  and  a  =  <J  .  Thus  1..  can  be  written  as 
”  K  —  b  K  b  U 
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where  a 


T 

a  C  a 


and 


a  *  nooied  estimate  ot  r. 

P 

Similarly  let  ./^denote  the  maximum  likelihood  under  arbitrary  para¬ 
meters  settings.  Y^is  written  as: 
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In  eenerai 


+  N‘  in  <J  '  (I) 

S  b  J 

with  o~  and  C  being  computed  separately  tor  R.S.  and  P. 

Thus  d  is  a  measure  ot  the  statistical  diilerences  between  the 
two  signal  segments.  The  larger  d  is.  the  more  the  parameters  ot  the 
segments  are  expected  to  be  ditterent.  Because  a  logarithm  was  used 
in  the  derivation  ot  d  to  make  it  zero  tor  identical  sets,  it  is 
called  an  "entropy  distance'1  by  Chen  13). 

The  order  p  ot  the  autoregressive  model,  the  size  ot  the  window 
w.  and  the  threshold  d c ^ .  above  which  the  segments  are  considered 
ditterent.  must  all  be  estimated.  A  suggestion  tor  estimating  the 
threshold  dth  is  to  construct  a  histogram  ot  the  entropy  distances 
which  have  been  calculated  tor  ail  adjacent  pairs  ot  segments  ot  the 
signal  alter  it  has  been  initially  segmented  into  equal-length  pieces, 
each  ot  size  w.  Thev  observe  that  such  a  histogram  seems  to  have 
Chi-square  distribution  which  might  help  in  selecting  a  cut-oil.  It 
is  noted  la!  that  w  2.  (p/3)-  where  p  is  the  order  ot  the  underlying  AR 
process,  because  "...in  anv  ergodic  time  series  where  statistical 
parameters  sre  calculated  as  time  averages  a  minimum  interval  ot 
length  L  is  necessary  to  estimate  the  statistical  parameters  with 
sufficient  accuracy"  Ip. 31).  In  general,  however,  selecting  p.w.  and 
dth  requires  experience  and  a  general  understanding  ot  the  character¬ 
istics  ot  the  signal. 

Our  segmentation  algorithm  consists  ot  a  broad  initial  search  tor 
an  optimum  boundary  followed  by  3  specific  point-by-point  search. 
These  two  parts,  themselves,  are  each  broken  down  into  two  sub¬ 
sections.  Part  1  begins  with  Rough  Boundary  Searching,  followed  by 
Optimum  Boundary  Searching.  The  final  search,  carried  out  in  Part  2. 
takes  one  ot  two  terms.  The  choice  is  based  on  the  location  ot  the 
currently  selected  optimum  boundary  point. 

'2.J.I  Part  1 

2. 3. 1.1  Rough  Boundary  Searching 

Once  w.  the  window  length,  and  p.  the  order  ot  the  autoregressive 
process,  have  been  selected  the  signal  is  partitioned  into  w- length 
segments,  each  labeled  as  in  Figure  1.  The  last  point  in  each  segment 
s^  is  called  a  node  and  labeled  n^.  For  consistency,  the  tirst 
segment  begins  at  point  2  ot  the  signal,  and  point  1  is  labeled  n<j. 


d  =  (  S' '  +  N  ’  )  In  0  N '  In  rrn 
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This  crude  segmentation  ot  the  signal  is  retined  in  the  remaining 
sections  ot  the  algorithm.  It  is  important  to  note  that  the  number  ot 
segment  boundaries,  which  is  initially  a  function  ot  the  window  length 
w.  is  a  fixed  upper  limit.  The  number  ot  boundaries  can  only  de- 
c  rease . 

The  entropy  distance  between  each  pair  ot  adjacent  segments  s^ 
and  s^+L  is  calculated  using  Eo .  (  1  )  .  and  that  value,  labeled 
d(Si.  s^  +  i  )  .  is  associated  with  the  intervening  node  n^. 

An  entropy  distance  threshold  dth  is  calculated  tor  the  data. 
This  threshold  is  used  during  Optimum  Boundary  Searching. 

For  the  remainder  ot  the  discussion  ot  the  segmentation  algor¬ 
ithm.  the  tirst  point  in  the  data  tile  will  be  referred  to  as  the 
lettmost  point  and  the  last  as  the  rightmost  point. 

2.J.1.2  Optimum  Boundary  Searching 

Locate  the  rightmost  dts^.s^+j)  which  is  greater  than  dth.  the 
threshold  distance.  Ail  the  nj  associated  with  each  d(s^.Sj;  +  1)  < 
are  no  longer  boundaries  because  the  segments  and  s^i  are  consi¬ 
dered  statistically  similar  since  their  entropy  distances  are  below 
the  threshold.  It  no  d(s^.s^+^)  is  greater  than  dth*  then  the  chosen 
threshold  should  be  re-evaluated. 

The  rightmost  d(s^.s^+l)  2  dc^  is  associated  with  node  n^  and  is 
called  the  current  optimum  boundary  between  the  two  segments.  When 
the  characteristics  ot  the  signal  change  between  any  two  consecuive 
segments,  the  entropy  distance  between  those  segments  will  exceed  the 
threshold.  A  finer  search  is  Chen  conducted  in  Sj+j  tor  a  better 
optimum  boundary.  The  iett  segment  is  not  searched  at  this  time. 
It  will  be  searched  it  the  final  optimum  boundary  is  to  the  left  ot 
Si+l  after  the  completion  ot  Part  2  ot  the  algorithm.  We  divide 
into  equal  subsegments  ot  length  ws .  The  nodes  are  labeled  in  the 
same  manner  as  before:  Node  is  relabeled  my  and  the  following 

points  are  m^.mu . mn.  The  last  subsegment  mn  is  also  labeled  n^+ j . 

The  labeling  ot  this  section  ot  the  signal  is  shown  in  Figure  2. 
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Two  pairs  ot  windows  <s^.s^+i)  and  ts'{,s‘(+1)  are  created. 
(Initially.  sj_  =  Sj_.)  They  are  shown  in  Figure  3.  We  call  s{  and 
s'l  test  windows,  and  sj  +  j  and  s'{+^  reference  windows  Id).  A 
window  must  be  at  least  w  points  lone. 
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These  windows  move  through  in  ws  increments  in  a  search  tor 
a  better  optimum  boundary  than  the  currently  selected  one.  The 
process  will  stop  after  mn_{  is  considered.  Stopping  at  mn_j  rather 
than  at  mn  is  a  departure  from  Chen's  specitication .  It  was  done  to 
simplify  the  point-by-point  search  carried  out  in  Part  i.  The  sizes 
ot  the  windows  will  vary  over  time.  The  only  conditions  are  (1)  that 
both  test  windows  always  start  at  n^.  (  2 )  that  s'i+i  is  always  w 
long  (the  minimum  window  length),  and  (  3)  that  both  reference  windows 
always  end  at  the  same  point. 

At  the  start,  the  windows  are  positioned  as  in  Figure  3.  Test 
windows  s^  and  s‘[  begin  at  n^_[.  These  test  windows  end  at  the 
same  point  at  which  their  reference  windows  begin:  namely  sj+j  at  m^ 
and  s'{+i  on  m^ .  The  junction  ot  both  s'  windows  is  3iwa_vs  the 
current  optimum  boundary  and  is  labeled  b^.  Therefore,  m^  *  b^.  The 
junction  ot  both  sa  windows  is  always  labeled  b^+j .  Therefore,  m |  * 
bi+j.  Both  reference  windows  end  at  the  point  bi+t  +  w. 

The  entropy  distance  between  the  s'  windows  is  calculated  again 
using  Eq . <  1 ) .  The  value  is  associated  with  b^.  The  entropy  distance 
between  the  s“  windows  is  likewise  calculated  and  associated  with 
bj[+1.  The  point  associated  with  the  larger  entropy  distance  is 
chosen  as  the  current  optimum  boundary.  It  the  distances  are  the  same 
then  the  current  optimum  boundary  is  not  changed. 
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The  end  ot  s ^  is  now  moved  to  the  current  optimum  boundary  and 
the  end  ot  s'{  is  moved  a  distance  wg  tram  its  former  position.  The 
label  b|  again  is  associated  with  the  optimum  boundary  and  the  end  ot 
s  i  •  The  end  ot  s'[  is  associated  with  b^+[.  It  is  necessary  to 
check  that  s^+i  and  s '{ + A  both  end  at  b^+i  +  w. 

A  new  pair  ot  entropy  distances  is  caicuiated.  the  larger 
selected,  and  the  windows  moved.  The  procedure  is  toiiowed  over  aii 
subsegments  up  to  and  inciuding  mn_|.  The  tinai  optimum  boundary  is 
located  at  bj_. 

1.5.2  Part  2 


The  search  continues  to  the  left  and  right  ot  b^  in  a  point-bv- 
point  manner.  The  range  to  be  covered  is  ( b^  -  ws,  +  ws).  and  the 
search  is  carried  out  in  the  same  manner  as  before.  The  point  -  ws 
is  considered  the  current  optimum  bondarv.  The  only  difference  is 
that  when  s'-[  changes  position,  it  moves  up  only  one  point.  Figure  4 
shows  the  arrangement  ot  the  windows. 
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Figure  4 

After  ail  points  are  examined  and  it  *  my  *  nn,  then  the  point 
associated  with  b^  is  accepted  as  the  optimum  boundary.  However,  it  b^ 
*  my  =  nn.  then  the  point  at  b^  is  accepted  oniv  it  the  tinai  is 
either  at  my  or  to  the  right  ot  my.  It  is  to  the  left  ot  my.  then 
it  is  necessary  to  look  at  the  entropy  distance  associated  with  the 
previous  pair  ot  windows  d( sj-j  ,  s^).  It  d  (  s  £ . s  ^ )  2  then  no 
boundary  is  accepted.  It  d(s^_£,si)  <  dj-fpthen  we  consider  d(s^_^.s^) 
to  be  greater  than  dtf,  and  search  tor  an  optimum  boundary  in  those 
segmenets . 

The  algorithm  is  repeated  tor  all  dls^s^.^)  2  d t ^ . 

The  limitations  ot  this  algorithm  are  minor.  First,  the  data 
tile  initially  must  be  segmented  into  equal  pieces.  Rarely  will  this 
division  come  out  equally,  so  some  points  at  the  end  ot  the  tile  will 
be  lost.  Second,  the  first  segment  will  not  be  searched  because  only 
is  considered,  and  similarly,  the  last  segment  will  never  be 
searched  because  ot  the  requirement  that  must  be  u  points  long. 

However,  these  lost  points  represent  oniv  a  small  traction  ot  total  in 
the  tile. 


2.4  Feature  Extraction 


Features  will  be  extracted  trom  each  segment.  The  goal  ot 
feature  extraction  is  the  characterization  ot  each  segment  by  a  set  ot 
measurements  (the  feature  rector)  that  are  invariant  in  the  presence 
ot  noise  or  sampie-to-sampie  ditterences.  Ideaiiv,  the  feature 
vectors  will  be  similar  tor  segments  ot  the  same  type,  and  easily 
distinguished  trom  each  other  tor  dissimilar  segments.  The  notion  ot 
feature  extraction  trom  a  sample  may  be  thought  ot  conveniently  as  the 
location  ot  a  point  in  an  /;-dimens  ional  space.  The  coordinates  ot 
the  point  are  the  values  ot  the  n  features,  and  the  label  ot  the 
point  is  its  class.  Thus,  we  intend  that  non-identical  segments  ot 
the  same  class  will  have  representations  which  are  close  to  each 
other.  The  pattern  recognition  problem  then  is  equivalent  to  the 
construction  ot  boundaries  in  the  /?-dimens  ion  a  1  space  that  result  in 
the  separation  ot  groups  ot  points  on  one  label  trom  groups  ot  points 
with  other  labels.  Ideally,  each  group  would  contain  points  all 
having  the  same  label. 

Our  candidate  teaures  will  include  time-domain  measurements 
(e.g..  the  number  ot  zero-crossings,  amplitude-based  measures 
I  mean-square ,  other  moments,  histogram  shape],  inter-peak  intervals 
and  slopes,  autoregressive  (AR)  parameters,  and  frequency-domain 
measurements  (e.g..  shape  ot  the  power  spectrum:  location  ot 
maximum-oower  band,  number  ot  maxima). 


2.b  Clustering 

Clustering  seeks  to  partition  a  given  data  set  into  homogeneous 
subsets  (clusters)  bv  considering  similarities  ot  data  points  (feature 
vectors)  in  each  subset  and  their  relationship  to  the  elements  ot 
other  subsets.  Typical  similarity  measures  are:  the  Euclidean 
distance,  the  city-block  distance,  the  Minkowski  metric  (a  generalized 
Euclidean  distance),  and  a  quadratic  distance  function  I6J.  The  use 
ot  such  metrics  as  similarity  measures  can  be  justified  by  the 
heuristic  argument  made  above  that  points  in  the  same  cluster  should 
be  close  to  each  other  and,  at  the  same  time,  distant  trom  the 
elements  ot  other  clusters. 

There  are  basically  two  approaches  to  clustering.  The  first, 
known  as  the  dynamic  clustering  method ,  uses  an  iterative  algorithm 
to  optimize  a  clustering  criterion  function.  Various  criteria  ot 
clustering  have  been  suggested  in  the  literature.  Among  these,  the 
most  useful  have  proved  to  be  the  family  ot  functions  that  quantity 
the  average  attinitv  ot  data  points  to  cluster  representatives.  At 
each  iteration  ot  a  dynamic  clustering  algorithm,  data  points  are 
assigned  to  clusters,  the  number  ot  which  must  be  specified  in 
advance.  The  assignment  is  performed  on  the  basis  ot  the  points1 
similarity  with  the  current  cluster  representatives.  In  subsequent 
steps,  the  cluster  representatives  are  updated  to  reflect  any  changes 
in  the  data-point  assignments.  Those  new  cluster  models  are  used  in 
the  next  iteration  to  reclassify  the  data,  and  the  process  is 
continued  until  a  stable  partition  is  obtained. 


iy 


A  second  approach,  known  as  hierarchical  clustering,  is  non-iter¬ 
ative.  At  any  stage  ot  a  hierarchical  clustering  algorithm  the  two 
most-similar  existing  clusters  are  merged,  thus  reducing  the  number  ot 
potential  clusters  by  one.  Alter  n-l  steps  where  n  is  the  cardinality 
ot  the  set  being  analyzed,  the  algorithm  terminates.  The  number  ot 
clusters  in  the  data  set  need  not  be  known  a  priori.  Rather,  natural 
clusters  ot  points  in  the  data  set.  tor  a  given  measure  ot  simiiaritv. 
are  detected  by  assessing  the  changes  in  the  values  ot  the  measure  at 
various  stages  ot  the  algorithm. 

A  number  ot  very  good  algorithms  ot  both  types  are  documented  in 
the  literature  I/.8I.  These  algorithms  will  be  evaluated  using  the 
features  extracted  as  indicated  above. 

2 .6  Evaluation 

At  this  stage  ot  the  work,  the  Army  will  provide  the 
phvsioiogicai-state  labels,  and  the  times  at  which  they  begin  in  each 
ot  the  ten  24-hour  signals.  This  information  will  permit  evaluation 
ot  the  segmentation  method  and  ot  the  several  clustering  methods.  It 
the  algorithms  are  working  well,  then  (I)  the  partitions  tound  in  this 
work  will  agree  with  the  beginning-points  ot  the  states  provided  by 
the  Army,  and  (2)  most  ot  the  points  in  a  given  cluster  will  have  the 
same  label. 

2.)  Feature  Selection 

The  features  chosen  will  be  prewhitened  both  approximately  (by- 
removing  ail  but  one  ot  the  features  making  up  a  set  ot 
highly-correlated  features)  and  exactly  (by  a  diaeonaiization  ot  the 
features'  correlation  matrix).  The  resulting  sets  will  be  evaluated 
in  several  ways: 


(1)  bv  the  probability  ot  error  associated  with  their  use  in  a 
classifier  (see  below) 

(2)  by  the  Karhunen-Loeve  transformation  (which  uses  the 
eigenvalues  ot  the  features’  covariance  matrix  to  rank  them  according 
to  their  intrinsic  ability  to  separate  the  samples);  and 

(2)  by  the  homogeneity  ot  the  clusters  tound  in  Sec.  2.0. 

2.8  Classifier  Design 

2.8.1  Parametric  Methods 

2. 8. 1.1  Introduction 

The  purpose  ot  pattern  recognition  is  to  determine  to  which 
category  or  class  a  given  sample  belongs.  Feature  extraction  provides 
a  set  ot  numbers  which  make  up  the  observation  vector.  The 
observation  vector  serves  as  the  input  to  a  decision  rule  by  which  we 
assign  the  sample  to  one  ot  the  given  classes.  Let  us  assume  that  the 
observation  vector  is  a  random  vector  whose  conditional  density 
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function  depends  on  its  class.  It  the  conditional  density  function 
tor  each  class  is  known  (this  is  the  parametric  case),  then  the 
pattern  recognition  problem  becomes  a  problem  in  statistical 
hypothesis  testing. 

Here  we  discuss  the  two-class  problem,  which  arises  because  each 
sample  belongs  to  one  ot  two  classes.  or  <.•>■>.  The  conditional 
density  functions  and  the  .a  priori  probabilities  are  assumed  to  be 
known.  See  Appendix  F  tor  variable  definitions. 

2.8. 1.2  The  Bayes  Decision  Rule  for  Minimum  Error 

Let  X  be  a  feature  vector,  and  let  it  be  our  purpose  to  determine 
whether  X  belongs  to  wj  or  u,  .  A  decision  rule  based  simply  on 
probabilities  may  be  written  as  follows: 


PU  /X)  <  P(..,.,/X) 


X  t 


!  “’l 

i 


(  2 . 8 . 1  - 1  ) 


The  a  posteriori  orobabi  1  it  ies  P(w  /X)  mav  be  calculated  from  the  a 

i 

priori  orobabi  lities  P(  m  .)  and  the  conditional  densitv  functions 

l 

d(X/w.).  usine  Baves'  theorem,  that  is 
l 


P(w.  /X) 

l 


o(  X/  w  .  )  P<  w .  > 

J _ 1 _ l_ 

p<  X ) 


Since  p(X)  is  common  to  both  sides  ot  the  inequality  (2.8. 1-1).  the 
decision  rule  ot  (2.8.  1-1)  can  be  expressed  as 


d(  X/u>  )  P(  w  )  <  p(  x/u. ,  )P<  W., )  -*  X  c  •: 

1  1  2  (  w., 


or 


i(X)  = 


D(  X/W1  ) 
o(X/u>.,) 


P(  w.,  ) 
P(  w,  ) 


->  X  £  <’ 


w., 


( 2.8. 1-2 ) 


The  term  1(X)  is  called  the  likelihood  ratio  (related  to,  but  not  the 
same  as,  that  used  in  segmentation)  and  is  the  basic  quantity  in 
hypothesis  testing.  We  call  P(<ov)/P(wj)  the  threshold  value  ot  the 
likelihood  ratio  tor  the  decision.  Sometimes  it  is  more  convenient  to 
write  the  minus-iog-iikeiiihood  ratio  rather  than  writing  the  likeli¬ 
hood  ratio  itself.  In  that  case,  the  decision  rule  (2.8. 1-2)  becomes 


-In  1(X) 


-  In  p (X/Wj)  +  In  p(X/u>.;)  >  In  <  P<  «  >  /  P(  <i>y  >  )• 


X  £ 


I  w-. 


( 2.8. 1-3) 
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The  direction  ot  the  inequality  is  changed  because  we  have  used  the 
negative  logarithm. 

Equation  (  _’ .  8.1-2)  or  <2. 8. 1-3)  is  called  the  Bayes  test  tor 
minimum  error. 

In  general,  the  decision  rule  ot  (2.8. 1-3).  or  any  other  decision 
rule,  does  not  lead  to  perfect  classification.  In  order  to  evaluate 
the  performance  ot  a  decision  rule,  we  must  calculate  the  probability 
ot  error,  that  is.  the  probability  that  a  sample  is  assigned  to  the 
wrong  class. 

Let  l  [  and  l  _>  be  the  regions  in  the  domain  ot  X  such  that 
P(  w  [  /  X  )  >  P(  iu7  /  X  )  and  P(  <u  j  /  X  )  <  P(  to  v  /  X  >  ,  respectively.  Then, 

it  X  t  t^.  we  assign  the  sample  to  class  The  probability  ot 

error  can  be  calculated  as  follows: 

E  =  Pr  (error)  =  Pr  (  error /wj }  P(  )  +  Pr  I  error/ wu )  P<  w  > ) 

It  the  sample  belongs  to  u>|.  an  error  occurs  whenever  X  t  h.  and. 
similarly,  it  the  sample  belongs  to  «>;>.  an  error  occurs  whenever  X  e 
I | .  Thus. 


t  =  Pr ( X  t  lv/w  )PU  >  +  Pr (X  e  1  /<».,)  P<«.,) 


r  I  . 


=  Pit,  )  /  0(X/w,  )dX  +  PU> 


.,)  J  pfX/to.,)  dX 


=  P(  (  +  P(  to.,  )  £ 


We  can  distinguish  two  types  ot  errors:  one  which  results  from 
misc iassitving  samples  from  t0|  and  the  other  which  results  from 
misc  lassitving  samples  from  The  total  error  is  a  weighted  sum 

ot  these  errors. 

The  problem  ot  calculating  the  probability  ot  error  is  solved 
essentially  by  the  integration  ot  density  functions  in  an  n-dimen- 
sionai  space.  Therefore,  it  is  sometimes  more  convenient  to  integrate 
the  density  tunction  ot  the  likelihood  ratio  d( i/wj ) .which  is 
one-dimensional . 

■  P(w.;)/P(w  ) 

t  =  /  “  1  p(l/w  )dl  (<(.8.1-4) 

1  J  U  1 

£.,=  I  p(  i/w  )dl 

'  J  P(w.,)/P<w  )  ’  L 

where  the  region  ot  integration  ot  (2.8. 1-4)  is  from  U  to  P(  u>v )  /(Pu>[) 
because  the  likelihood  ratio  is  always  positive. 
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In  Che  common  case  when  the  p(X/ui^)‘s  are  normal  with  expected 
vectors  and  covariance  matrices  2.^,  the  decision  rule  ot  (2.3. 1-3) 
becomes 


ht  X)  =  -  in  i(X> 


I  T  - 1  1  T  - 1  1 

—  (X  -  Mt>  l,  (X  -  M  > - j~  (X  -  M7)  l.,  (X  -  M.,  >  +  -y-  in  — 


1  ' 


$  ,  _!!V 

'  ln  P(w.,) 


X  € 


I 


(2.3. I -J) 


Equation  (2.3. 1-0)  shows  that  the  decision  boundary  is  given  by  a 
quadratic  torm  in  X. 

When  L,  =  1.,  =  i.  the  boundarv  becomes  a  linear  lunction  ot  x.  as 

1  2  l 

T  -I  I  T  -1  T  -I 

h(  X )  =  <M.,-  M  )  1  X  +  -7-  (M  2.  My  M.,  2.  M.,) 


<  ,  piv 

in  Pfu.,) 
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2.8.2  A  nonparamet ric  Method:  The  Empirical  Processing  Algorithm 
(EPA) 

2.8.2. 1  Origin  ot  the  EPA  and  the  Structure  ot  Its  Classitier 

2.8.2. 1.1  Background  and  Development  ot  the  EPA 

A  nonparamet ric  problem  in  two-class  pattern  recognition  was 
considered  by  Henrichon  ( 1  I  and  by  Loew  and  Fu  I10J:  one  approach 
involved  the  use  ot  the  empirical  distribution  tunction  as  an 
approximation  to  the  underlying  distribution  tunction.  A  principal 
result  obtained  there  in  the  case  ot  one-dimensional  observations 
(i.e..  the  feature  vector  has  only  one  element)  was  an  algorithm  tor 
determining  the  relative  extrema  ot  the  tunction: 

.  t  (  x  j  u>  j )  2  1.  x  £  u>  j 

t(  x|<o.,  >  <  1 .  x  e 


Here  l(x|wj>  and  t(x|u>y)  are  the  (assumed)  continuous  cumulative 
distibution  functions  (cdtss)  tor  the  populations  (classes)  and  . 

The  need  to  tind  the  relative  extrema  ot  (2-1)  was  motivated  by 
the  following  observations.  It  we  assume  equal  3  priori  class 
probabiiitites ,  and  equal  costs  ot  misciassitication,  then  the 
likelihood-ratio  test  yields  the  optimal  decision  boundaries  (those 
which  minimize  the  expected  risk).  In  the  two-class  case,  the 
resulting  decision  rule  is,  it 
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t<  XU{  ) 
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t<  xlw., ) 
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whe  re  t  (  x  1  .*< 
( pdt ' s  )  tor 
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[  )  and 
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3  S 

t(  Xlwj  ) 

classes . 

are  the  pr obabi 1 i t v  . dens i t v 
The  decision  boundary  ot  (  2-2 

functions 

)  can  then 

t!  x|wt  )  -  t<  x!w_. )  =  U 

But  Che  same  result  can  be  obtained  it  the  locations  ot  the  maxima  and 
minima  ot  (2-11  can  be  tound. 

The  algorithm  tor  tinding  those  extrema  makes  use  ot  the  empir¬ 
ical  cdt's  tor  the  two  classes,  and  assumes  that  they  are  both 
continuous  and  monotonicaliv  increasing.  The  empirical  cdt  is  detined 
as 

number  ot  )x,.  x.. .  x  l  <  x 

„  ....  _ _ L _ A _ IL _ 


In  I.  V  I  the  asymptotic  optimality  ot  the  algorithm  is  proved,  that  is, 
that  the  boundaries  obtained  trom  it  converge  to  the  optimal  ones 
which  would  be  determined  trom  <2-1 i).  In  addition,  expressions  tor 
the  probability  ot  misciassitications  are  tound. 

The  method  described  above  has  the  advantage  ot  requiring  little 
a  priori'  information:  it  assumes  that  the  underlying  distributions 
are  continuous,  which  in  most  cases  is  not  a  serious  restriction.  It 
is  able  to  deal  with  multimodal  distributions  and  rauiitpie  decision 
boundaries . 

Two  principal  disadvantages,  however,  are  noted  when  an  attempt 
is  made  to  use  the  method  in  multidimensional  and/or  multiclass 
problems,  as  would  typify  the  present  work.  We  observe  that,  tor  the 
muitidimensional-teature-vector  case.  the  multivariate  analog  ot  a 
tunction  composed  as  the  ditterence  ot  empirical  cdt’s  cannot  be 
stored  in  a  computer  in  a  torm  amenable  to  convenient  extrema 
determination.  In  the  n-ciass  case,  the  only  possibility  tor 
implementation  ot  the  algorithm  would  seem  to  be  the  committee 
solution  technique  III),  which  requires  all  two-class  comparisons  to 
be  made:  then  the  class  with  a  majority  ot  tavorabie  decisions  is 
selected.  For  large  problems  this  approach  might  not  be  acceptable, 
however,  since  n(n-l)/2  individual  two-class  classifiers  would  be 
required. 

A  method  is  proposed  in  IV.1UJ  which  is  an  attempt  at  avoiding 
the  ditticuities  listed  above.  Instead  ot  approximating  the 
ditterence  ot  distribution  functions,  the  procedure  partitions  the 
combined  rank  ordering  ot  the  multiclass  samples.  It  is  suitable  tor 
both  multidimensional  and  muiticiass  classification.  The  next  section 
presents  the  general  structure  ot  the  classifier,  and  Section  2. 8. 2. 2 
gives  the  method  and  an  algorithm  tor  determining  the  specifications 
ot  that  structure. 
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2.8.2. 1.2  The  General  Classitier  Structure 


The  basic  building  block  ot  the  classitier  is  called  the  basic 
selection  unit  (BSU).  and  as  many  BSU's  as  required  are  interconnected 
to  torm  the  classitier.  A  typical  BSU  is  shown  in  Figure  2-1.  The 
vector  F  is  a  1-dimensionai  teature  vector  to  be  ciassitied  by  the  BSU 
as  coming  trom  one  ot  the  m+1  pattern  classes  wy,  uij ....  .wm.  where  ,*,y 
denotes  indecision.  As  will  be  explained  in  Section  2 . 8 . 2 . 2 .  it  a 
decisoin  is  made  that  the  pattern  class  is  a  member  ot  1  wj , .  .  .  . 
the  procedure  stops:  it  the  BSU  is  unable  to  make  a  classification 
trom  that  set.  its  decision  is  <oy,  and  the  procedure  continues.  The 
BSU  allows  tor  these  two  possibilities  by  having  two  kinds  ot  output 
lines:  a  decision  line  3nd  a  set  ot  selection  lines.  It  <yy  is  the 
decision  made  by  a  particular  BSU.  then  one  ot  q  selection  lines, 

Si<i=l . a),  trom  that  BSU  will  become  active.  A  logical  one  (L) 

shall  be  used  to  indicate  an  active  state  ot  a  selection  line,  i.e., 
s =  1  :  and  a  logical  zero  (0)  tor  an  inactive  state  (s^=0).  We 
denote  by  s^  ^  ^  the  ith  selection  line  trom  the  jth  BSU  in  the  kth 
layer  ot  the  classitier  (see  Figure  2-2).  The  selection  line  performs 
the  function  ot  triggering  the  active  state  (1)  ot  a  BSU  in  the  next 
(i.e..  (k+l>st)  layer  ot  the  classitier.  Unless  the  selection  line 
entering  a  BSU  is  active  (s=l),  that  BSU  will  remain  inactive,  and  no 
processing  will  occur  within  it. 

The  decision  line  d ^ k  originates  at  the  jth  BSU  in  the  kth 
layer,  and  is  active  only  it  the  BSU  has  decided  that  the  pattern 
represented  by  F  is  an  element  ot  (  uij  ,  .  .  .  ,  .  Thus.  the 
allowable  states  tor  d , _ ^  are 


d 


i  .k 


i . 

0, 


it  class  v> .  is  decided.  i*l . m  (active  states) 

l 

it  class  is  decided  (inactive  state). 


The  BSU.  then,  serves  to  either  (1)  reach  a  decision  as  to  which 
ot  m  classes  should  be  assigned  to  the  input,  ot  (2) 
decide  which  subsequent  BSU  should  repeat  the  process.  For  any  given 
input  to  a  structure  ot  BSU's  there  can  be  at  most  one  BSU  which  has 
an  active  decision  line.  Other  constraints  on  BSU  operation  also 
follow  trom  the  structure  definitions  given  above,  and  can  be 
summarized  by  the  following  expressions. 


(  U  inactive  state, 

(  1 )  s .  .  ,  =  <  , 

l.i, k  j  1  active  state. 


ail  i, i .  and  k. 


(2)  V  s.  <  1,  all  i  and  k. 
i»l 


This  savs  that,  at  most,  one  seiction  line  oer  BSU  can  be  active. 
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*  u 
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The  decision  line  and  selection  Line  states  are  mutually  exclusive. 


4)  (  s.  +  d  ) 
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A  BSU  is  inhibited  unless  there  is  an  active  selection  line  input  to 
it  trom  the  previous  layer. 

Figure  2-i  indicates  the  conceptual  structure  ot  a  BSU.  Each  ot 
its  three  main  components  as  well  as  the  interconnection  scheme  is 
discussed  in  the  next  section. 

2. 8. 2. 2  Construction  ot  the  Classifier 

2. 8. 2. 2.1  Introduction 

The  three  subunits  ot  a  BSU.  as  shown  in  Figure  2-i.  are  the 
transgeneration  box  (optional),  which  augments  the  incoming  feature 
F  by  torming  new  features  which  are  combinations  ot  the  components  ot 
F:  the  component  selection  box  selects  one  ot  the  features  (original 
or  transgenerated)  as  an  input  x  to  the  threshold  unit.  The 
threshold  unit  activates  one  ot  the  a  selection  lines,  or  the  decision 
line,  depending  on  the  value  ot  x. 

In  the  following  discussions  the  decision  lines,  selection  line 
labels,  and  component  selection  box  will  be  omitted  trom  the  structure 
diagrams.  A  circled  number  next  to  a  terminal  region  in  the  threshold 
unit  shall  indicate  which  pattern  class  is  selected.  The  algorithm 
which  follows  determines  which  feature  component  x  should  be  selected 
in  each  BSU.  and  the  intervals,  corresponding  to  the  domain  ot  x, 
which  should  be  connected  to  the  selection  and  decision  lines. 

2. 8. 2. 2. 2  The  Algorithm 

Let  X*<X| . xnj>  a  set  °1  nl  independent  observations 

trom  class  [ .  and  let  Y  *  ty|,...,ynut  be  a  set  ot  nv 
independent  observations  trom  class  •  Let  K  and  9  be  two 
prespecitied  parameters  determined  trom  the  combined  sample  size 
n*nj+nu.  (Some  guidelines  tor  the  choices  ot  K  and  9  are 
discussed  below.  ) 

Step  t.  Order  the  combined  sample  set  X+Y  according  to 
increasing  numerical  value  to  form  an  ordered  set  Z.  Partition  the 
set  Z  into  successive  groups  ot  K  samples.  (See  Figures  2-4(a)  - 
2-4<  c  ) .  ) 

Step  2.  For  each  group  count  the  total  number  ot  x's  and  v's 
and  assign  a  class  label  as  follows: 
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“t  number  ot  x's  >  6.  .assign  class 

number  ot  v's  >  9.  assign  class  „>■< 
otherwise,  assign  class  wy 

Then  merge  adjacent  regions  which  were  assigned  the  same  class 
(Figures  2-4(d)  -  2-4(e)). 

Step  i.  Adjust  the  boundaries  by  perturbing  them  a  maximum  ot 
K/2  samples  in  either  direction  and  relocate  them  at  positions  where 
the  most  improvement  in  c la s s iticat ion  accuracy  is  obtained  (Figure 
_’-4(  t )  ) . 

Step  It  fewer  than  K/2  samples  remain  in  any  one  region, 

dissolve  that  region  and  place  its  samples  (preserving  the  rank  order) 
among  the  neighboring  two  regions  so  as  to  yield  the  least  increase  in 
misciassitication  (Figure  _’-0<g>). 

Step  i.  Repeat  Step  2.  For  this  final  partition  compute  the 
empirical  classification  probability,  or  SCORE,  as 

SCORE  =  number  ot  samples  correctly  classified. 

This  training  procedure  thus  yields  a  set  ot  thresholds  to  which 
the  (single)  feature  value  ot  an  unknown  input  sample  would  be 
compared:  its  class  would  then  be  assigned  according  to  the  label  ot 
the  region  along  the  "feature  axis'*  in  Figure  2-4(g)  in  which  it  tell. 

The  procedure  can  be  extended  to  the  case  ot  a  multidimensional 
feature  space  by  applying  the  algorithm  to  each  feature  separately, 
and  the  SCORE  (from  Step  o)  tor  each  component  is  recorded).  The 
dimension  associated  with  the  highest  SCORE  is  selected  as  the 
dominant  dimension.  The  observation  space  is  then  partitioned  by 
parallel  hvperplanes  which  have  the  dominant  dimension  as  the  common 
normal  vector  and  intersect  it  at  the  boundaries  determined  by  the 
algorithm  tor  this  feature.  For  each  region  formed  by  the  above 
partitions,  the  procedure  is  repeated  until  no  new  regions  are 
produced.  Each  time  one  or  more  regions  ot  class  (indecision)  in 
a  layer  ot  the  classifier  are  subdivided,  an  additional  layer  is 
added,  and  this  is  what  produces  the  selection  lines  ot  Figure  2-2. 

As  mentioned  in  Section  2 .  8 . 2 . 1 . 1 ,  one  ot  the  disadvantages  ot 
the  ext rema-dete rmina t ion  algorithm  was  that  the  extension  to  the 
muiticiass  case  would  involve  a  substantial  amount  ot  additional 
calculation. 

•  That  disadvantage  does  not  exist  tor  the  algorithm  just 
discussed:  the  extension  is  direct.  The  original  procedure  tor 
labeling  the  partitioned  regions  was:  it 

number  ot  x's  >  0.  assign  wj 
number  ot  v's  >  0.  assign 
otherwise,  assign  wy. 

This  step  can  be  reformulated  by  considering  n  pattern  classes 

. wn  (letting  u)(j  continue  to  denote  indecision).  Let 

represent  an  observation  whose  true  class  is  .  Then  the 
reformulated  assignment  rule  tor  each  block  is:  it 

number  ot  xwi's  >  9  tor  some  i,  assign 
otherwise.  assign  wy 


it 


The  classifier  structure  of  Figure  2-2  remains  as  it  was  in  the 
two-class  case. 


2. 8. 2. 2. 3  Comments  on  the  use  of  the  Algorithm 

The  choice  of  K  in  Step  1  of  the  algorithm  is  not  of  concern  in 
the  concept  of  the  procedure,  but  does  play  a  part  in  the 
implementation  of  the  method.  The  purpose  of  partitioning  the 
combined  sample  set  (of  size  n,  say)  is  to  reduce  the  number  of 
partitions  which  need  be  considered.  This  also  has  the  effect  of 
placing  an  upper  bound  on  the  complexity  of  the  classifier  at  the 
outset,  since  the  first  BSU  in  the  classifier  will  not  require  more 
than  K  thresholds,  and  subsequent  BSU's  (it  any)  will,  in  general, 
have  fewer  thresholds  than  the  first.  The  only  constraint  required  on 
K.  therefore,  is  that  it  does  not  increase  as  fast  as  n  does.  i.e.. 


In  the  exoeriments  described  below.  K  was  chosen  oroDortional  to 
\1/-. 

An  obvious  condition  on  the  value  of  8  used  in  Step  2  is  that 
0>K/2.  It  8  <  K/2.  then  conceivably  two  or  more  classes  would  satisfy 
the  inequality  governing  the  assignment  of  classes  to  the  groups  of 
samples,  resulting  in  an  ambiguous  procedure.  Henrichon  1.9  1  has 
empirically  found  that  good  results  are  obtained  with  0  =  0.6K  +  (A), 
where  0  <  A  i  b.  This  approximation,  with  various  values  of  A.  was 
used  in  the  experiments  presented  below. 

The  SCORE  computed  in  Step  b  of  the  algorithm  is  used  in  the 
multidimensional  case  to  determine  the  order  in  which  features  should 
be  chosen  by  the  component  selection  boxes  (Figure  2-3)  of  the  BSU's. 
As  specified  in  Step  b.  the  SCORE  was  simply  the  number  of  correctly 
classified  samples.  This  is  a  reasonable  approach  as  long  as  the 
number  of  training  samples  per  class  remains  the  same  from  feature  to 
feature.  In  some  situations,  however,  the  number  of  samples  available 
per  feature  varies.  A  straightforward  solution  in  that  case  is  to  use 
a  normalized  SCORE,  i.e.,  a  percentage.  Where  necessary,  then,  the 
SCORE  will  be  defined  as 


SCORE  «  100  x 


number  of  correctly  classified  samples 
total  number  of  samdes 


2.R.3  Examples 


To  test  the  design  process  tor  the  two  kinds  of  classifiers 
described  above,  we  must  use  data  from  some  real  cases.  Because  the 
segmentation  routine  is  still  under  development,  we  have  chosen  to  use 
some  data  that  describe  ultrasound  signals  acquired  during  examination 
of  human  livers.  The  goal  is  to  use  features  extracted  from  the 
signals  to  classify  3  liver  as  normal  or  abnormal  (in  this  case, 
hepatitis ) . 

Four  features  were  used: 
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(1)  d.  the  average  spacing  between  scatterers 

(2)  r.  the  ratio  ot  the  specular  to  the  dittuse  backscatter 
intensities 

(  i )  <JS.  the  ratio  ot  the  standard  deviation  ot  the  specular 
backscatter  to  the  dittuse  backscatter 

(4)  c< .  the  attenuation  coefficient. 

In  the  following.  the  tour  features  are  abbreviated, 
respectively,  as  D.R.  V.  and  A. 

The  classifier  designs  were  the  following: 

(1)  Parametric  case 

The  Baves  classifier  was  used  under  two  alternative 
cases:  (a)  that  the  covariance  matrices  were  different  tor  the 

different  classes  (resulting  in  a  nonlinear  boundary  in  feature 
space),  and  (b)  that  the  matrices  were  the  same  (resulting  in  a  linear 
boundary).  The  design  data  were  separated  by  class  for  estimation  ot 
the  covariance  matrices  in  the  first  case,  and  pooled  in  the  second 
case . 


ot 


(  .') 


K  and 


N’onparamet ric  Case 

The  design  algorithm  was 

e. 


followed. 


tor  various  values 


The  results  tor  all  ot  these  trials  are  presented  in  the  next 
section. 


2.9  Classifier  Testing 

The  probability  of  error  is  the  key  quantity  in  pattern 
recognition:  the  estimation  ot  that  quantity,  therefore,  deserves 
special  consideration.  There  are  two  kinds  ot  problems.  The  first  is 
the  estimation  ot  the  probability  ot  error  from  available  samples, 
assuming  that  a  classifier  is  given.  The  second  is  the  estimation  ot 
the  probability  ot  error  tor  given  distributions.  For  this  problem, 
the  probability  ot  error  depends  on  the  classifier  to  be  used  as  well 
as  on  the  distributions.  Therefore,  we  first  have  to  specify  the 
nature  ot  the  classifier  (e.g.,  the  Bayes  classifier  tor  minimum  error 
that  was  defined  above):  the  task  then  becomes  one  ot  finding  a  way  to 
use  available  samples  tor  designing  the  classifier  and  evaluating  the 
error.  Since  we  have  only  a  finite  number  ot  samples,  we  cannot 
design  the  optimum  classifier,  and  the  parameters  ot  the  ciassiter 
are.  therefore,  also  random  variables.  Furthermore,  based  on  this 
random  ciassiter.  we  have  to  estimate  the  probability  ot  error. 

It  we  assume  that  sufficient  data  are  available  tor  estimating 
accurately  the  class-conditional  density  (and  distribution)  functions 
ot  our  features,  then  classical  methods  exist  112J  ter  estimating  the 
probability  ot  error  from  N  samples  drawn  from  those  distributions. 
Those  methods  will  be  employed,  with  random  sampling,  to  estimate 
error  probabilities  and  their  confidence  intervals  tor  the  Army  data. 


To  continue  the  examples  presented  in  Sec.  _'.8,  however,  we  must 
use  the  second  ot  our  approaches,  namely,  that  tor  the  limited-data 
case.  When  N  samples  are  given  without  a  ciassitier  design,  we  have 
to  use  those  samples  to  design  a  ciassitier  as  well  as  to  test  it. 
The  probability  ot  error  to  be  estimated  depends  on  the  given 
distributions  and  on  the  ciassitier  to  be  used.  A  number  ot  usetui 
theoretical  results  have  been  obtained  I1JJ  tor  the  case  ot  the  Bayes 
ciassitier  tor  minimum  error:  we  will  use  those  results  not  only  tor 
the  Bayes  classifiers  but  also  tor  the  nonparametric  ciassitiers.  all 
as  described  in  Sec.  3.8. 

We  described  two  approaches:  (1)  N  samples  are  used  to  design 
the  Bayes  ciassitier  and  the  same  N  samples  are  tested.  This  method 
has  been  shown  1111  to  yield  an  optimistic  bias  ot  the  probability  ot 
error;  (3)  N  samples  are  used  to  design  the  Bayes  ciassitier,  and  the 
samples  trom  the  true  distributions  are  used  tor  testing.  This  method 
also  yields  a  biased  estimate  ot  the  probability  ot  error,  but  the 
bias  is  such  that  the  expected  value  is  an  upper  bound.  The  samples 
trom  the  true  distribution,  however,  may  be  replaced  by  the  samples 
which  are  not  used  to  design  the  ciassitier  and  which  are  independent 
ot  them.  As  the  number  ot  test  samples  increases,  the  distributions 
ot  the  test  samples  tend  toward  the  true  distributions. 

There  are  several  ways  to  realize  this  second  approach.  .  The 
tirst  is  to  divide  available  samples  into  two  groups  3nd  use  one  ot 
them  tor  designing  the  ciassitier  and  the  other  tor  testing.  The 
second  is  a  refinement  ot  the  tirst:  we  take  out  one  sample,  design  a 
ciassitier  by  using  N-i  samples,  and  test  the  unused  sample.  This  is 
called  the  ieaving-one-out  method.  This  operation  is  repeated  N  times 
and  the  number  ot  misclassitied  samples  is  counted.  The  proportion  ot 
the  total  that  that  number  represents  is  then  the  estimate  ot  the 
probability  ot  error.  A  disadvantage  ot  this  method  is  that  V 
ciassitiers  must  be  designed.  For  the  nonparametric  case,  however,  it 
is  the  only  ettective  conservative  approach. 

Appendix  B  contains  contusion  matrices  tor  the  examples  ot  Sec. 

I. 8.  computed  using  both  methods  —  testing  using  the  design  set.  and 
leaving-one-out . 

J.  Results 

.  We  now  consider  the  results  ot  applying  these  ciassitiers  to  the 
two  kinds  ot  liver  disease.  Three  different  kinds  ot  ciassitiers  are 
used  tor  each  ot  the  tour  features.  In  addition,  ail  possible  subsets 
ot  the  tour  features  are  used  to  evaluate  the  performance  ot  multiple 
measurements.  In  ail  cases,  a  standard  format  is  used:  the  contusion 
matrix.  The  contusion  matrix  has  as  its  rows  the  names  ot  the  two 
correct  classes  and  as  its  columns  the  names  ot  the  classes  ot  the 
decisions  made  by  the  ciassitier.  Hence,  an  ideal  ciassitier  would 
have  all  entries  on  the  diagonal,  indicating  perfect  classitication. 
We  compute  error  probability  as  the  total  number  ot  misclassitied 
samples  divided  by  the  total  number  ot  samples. 

The  three  ciassitier  cases  we  consider  are:  the  ieave-one-out 
case  with  the  Bayes  rule  tor  normally  distributed  data  with  equal 
covariance  matrices.  (Note  that  the  fourth  equality  ot  the  covar¬ 
iance  matrices  by  pooling  the  data  tor  the  covariance  calculation.): 
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Che  second  case  keeps  Che  two  classes  separaCe  and  computes  covariance 
matrices  individually.  Again,  che  terms  “linear*1  and  “nonlinear1'  are 
used  because  ot  Che  nature  ot  the  decision  boundary  that  results  trom 
the  two  kinds  ot  covariance  matrices.  The  third  kind  ot  classifier 
again  uses  two  separate  ly-computed  covariance  matrices  but  now  the 
design  set  ot  data  is  also  used  as  the  testing  set.  As  was  noted 
earlier,  this  yields  an  optimistic  estimate  ot  error  probability  tor 
che  classifier. 

Appendix  B  presents  the  three  contusion  matrices  tor  each  feature 
and  combination  ot  features.  A  total  ot  4/  samples  was  used:  eighteen 
trom  the  normal  class  and  twenty-nine  trom  the  abnormal.  Note  che 
relatively  high  probabilities  ot  error  tor  all  three  classifiers  tor 
certain  features  and  groups  ot  features.  These  relatively  high  values 
are  due  in  part  to  the  error  associated  with  estimating  parameters  ot 
Che  probability  density  function  ot  a  small  number  ot  samples.  We  may 
contrast  these  results  with  the  second  set  shown  in  Appendix  B:  here 
we  have  /V  samples  total.  Note  that  the  errors  are  in  general 
smaller. 

In  any  pattern  recognition  problem,  feature  selection  is  a  very 
important  step.  From  a  set  ot  candidate  features  —  even  after  they 
have  been  de-correlated  --  one  generally  seeks  to  use  the  smallest 
number  sufficient  to  achieve  the  desired  error  probability.  Several 
recent  results  I  14,151  make  it  clear  that  in  order  to  choose  the  best 
subset  ot  any  given  size  ot  an  initial  candidate  set  ot  features,  one 
must  examine  all  possible  subsets.  In  tact,  it  is  possible  to  do 
arbitrarily  badly  it  the  search  is  non-exhaustive .  In  light  ot  those 
results,  we  examined  the  fifteen  possible  subsets  ot  our  tour  fea¬ 
tures.  Figures  1  and  2  in  Appendix  B  illustrate  the  variation  in 
probability  ot  error  tor  different  kinds  ot  classifiers  when  different 
subsets  ot  features  are  used.  For  example,  in  the  case  in  which  we 
have  4/  samples  all  together,  we  note  that  the  two  best  features  using 
the  nonlinear  round-robin  or  leave-one-out  classifier  were  R  and  D. 
but  the  best  set  ot  two  features  were  D  and  V.  This  illustrates  the 
tact  that  in  general,  the  best  two  features  are  not  necessarily  the 
two  best.  Note  also  that  it  we  instead  use  the  linear  leave-one-out 
classifier  shown  in  the  left  ot  each  set  ot  three  bars  in  the  figure, 
that  the  two  best  features  are  D  and  V,  and  that  the  best  are  also  D 
and  V.  Thus  we  may  generalize  to  say  trom  this  example  that  the  best 
two  are  not  necessarily  the  two  best.  That  is  true  as  well  in  the 
case  ot  three-at-a-time.  where  the  best  three  are  A,  R  and  D,  but  the 
three  best  are  D,  R.  and  V.  It  therefore  is  important  to  perform  an 
exhaustive  search  ot  ail  possible  subsets  it  the  goal  is  to  find  the 
best  performance  at  a  given  number  ot  features.  In  the  /V-sample 
case,  in  both  ot  the  leave-one-out  cases  (the  left  bar  and  center  bar 
ot  each  triplet)  we  see  that  the  two  best  features  are  V  and  D  and 
that  the  best  two  features  are  also  D  and  V.  Note  also  that  the  best 
triplets,  D,R,V  and  D,V,A.  are  both  composed  ot  the  best  pair.  D,V.  In 
the  previous  data  set  (the  4/-sampie  case)  the  best  triplet,  D,R,V  as 
measured  by  the  nonlinear  leave-one-out  method  was  indeed  composed  ot 
the  best  two  (D.V):  but  when  we  examine  the  linear  leave-one-out  case, 
there  is  a  tie  tor  the  best  set  (D,V,A  and  D,R,A),  the  latter  ot  which 
is  composed  neither  ot  the  best  two  nor  ot  the  second  best  two. 


We  ace  led  to  conclude  that  any  kind  ot  step-wise  search  tor  a 
best  subset  ot  features  will  not  in  general  produce  optimal  results. 
This  experimental  conclusion  supports  the  theoretical  results  cited 
earlier. 

The  non-parametric  procedure  was  applied  to  the  set  ot  /'a 
samples.  The  contustion  matrices  that  resulted  are  shown  in  Appendix 
B.  Notice  that  tor  the  features  used  individually,  i.e..  stopping  the 
process  at  the  first  stage,  yields  results  not  tar  different  from  what 
we  achieved  earlier.  In  the  case,  however,  where  we  had  two  features 
being  used,  that  is.  either  D  and  V  or  D  and  R.  our  performance 
improved  considerably  (0.14  error  probability).  Although  the  method 
has  not  yet  been  tested  with  the  ieave-one-out  procedure,  that 
programming  is  nearly  complete  and  is  expected  to  yield  similar 
results . 

4.  Conclusions 

Work  to  date  has  yielded  a  set  ot  tools  that  will  now  work  well 
once  features  have  been  extracted  from  the  actual  time-series  data 
that  we  have.  The  experimental  results  on  the  liver  ultrasound  data 
are  verv  encouraging  and  we  believe  that  there  wiii  be  good  perfor¬ 
mance  once  the  physiological  data  are  analyzed.  This  set  ot  toots, 
along  with  the  segmentation  routines,  should  yield  results  early  in 
the  third  year  ot  this  study. 

b.  Plans  tor  the  Coaine  Year 

During  the  coming  year  we  expect  to  segment  ail  ot  the  time- 
series  data,  independently  tor  both  the  activity  and  the  heart-rate 
data.  Separately,  we  will  use  a  hierarchical  approach,  referred  to 
earlier  in  this  report,  in  which  we  wiii  allow  the  segments  apparent 
in  the  activity  data  to  guide  the  segmentation  ot  the  heart-rate  data. 
Again,  the  rationale  tor  this  is  that  the  activity  data  are  in  general 
cleaner  than  the  heart-rate  data.  Once  a  set  ot  segments  has  been 
established  the  features  that  are  extracted  will  be  submitted  to  the 
classifiers  that  have  been  described  above.  The  classification 
accuracy,  concordance  with  physiological  truth  (as  supplied  by  the 
Army),  the  number  ot  features,  and  their  ease  ot  extraction  all  wiii 
be  evaluated  with  the  coal  ot  eventually  constructing  a  very  simple 
signal-processing  system  tor  determination  ot  physiological  state. 
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Plots  ot  ECG  data  before  and  after  preprocessing  and  noise  removal 
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APPENDIX  F 


Classifier  Variable  Definition 


The  variables  are: 


Jmax  =  number  ot  components  in  the  feature  vector 

S'  =  number  ot  observations 

=  observation  point  in  Jmax-space 

D(  J )  =  array  ot  the  N  observations  ranked  according  to  the  Jth 
complement 

Ri  =  a  region  in  Jmax-space 
Ci  =  classification  assigned  to 

IN’  ^  =  rank  ot  first  point  in  DM)  which  is  contained  in  R^ 

FN =  rank  ot  last  point  in  D(  J )  which  is  contained  in  R^ 

K  =  the  size  ot  the  blocks  into  which  the  teature-value  axis  is 
initially  partitioned 

Imax  =  number  ot  new  regions  formed 

S( . )  =  stored  value  ot  . 


